perm filename PROPOZ[7,ALS] blob sn#032372 filedate 1973-04-03 generic text, type T, neo UTF8
                                        April 1 1973

        A Proposal for Speech Understanding Research


        It ␈α∧is ␈α∧proposed ␈α∧that ␈α∧the ␈α∧work ␈α∧on ␈α∧speech ␈α∧recognition ␈αβthat ␈αβis ␈αβnow ␈αβunder ␈αβway ␈αβin ␈αβthe ␈αβA.I. ␈αβ ␈αβproject ␈αβat
Stanford ␈ααUniversity ␈ααbe ␈ααcontinued ␈ααand ␈ααextended ␈ααas ␈ααa ␈ααseparate ␈ααproject ␈ααwith ␈ααbroadened ␈α↓aims ␈α↓in ␈α↓the ␈α↓field ␈α↓of
speech understanding.

        It ␈απis ␈απfurther ␈απproposed ␈απthat ␈απthis ␈αεwork ␈αεbe ␈αεmore ␈αεclosely ␈αεtied ␈αεto ␈αεthe ␈αεARPA ␈αεSpeech ␈αεUnderstanding
Research ␈α↓groups ␈α↓than ␈α↓it ␈α↓has ␈α↓been ␈α↓in ␈α↓the ␈α↓past ␈α↓and ␈α↓that it have as its express aim the study and application
to ␈α	speech ␈α	recognition ␈α	of ␈α	a ␈α	machine ␈α	learning ␈α	process, ␈αλthat ␈αλhas ␈αλproved ␈αλhighly ␈αλsuccessful ␈αλin ␈αλanother
application ␈αβand ␈αβthat ␈αβhas ␈αβalready ␈αβbeen ␈αβtested ␈αβout ␈ααto ␈ααa ␈ααlimited ␈ααextent ␈ααin ␈ααspeech ␈ααrecognition. ␈αα ␈ααThe ␈ααmachine
learning ␈ααprocess ␈ααoffers ␈α↓both ␈α↓an ␈α↓automatic ␈α↓training ␈α↓scheme ␈α↓and ␈α↓the ␈α↓inherent ␈α↓ability ␈α↓of ␈α↓the ␈α↓system ␈α↓to ␈α↓adapt
to various speakers and dialects.  Speech recognition via machine learning represents a global approach to
the ␈ααspeech ␈ααrecognition ␈ααproblem ␈ααand ␈ααcan ␈ααbe ␈ααincorporated ␈ααinto ␈ααa ␈ααwide ␈ααclass ␈ααof ␈ααlimited ␈ααvocabulary ␈α↓systems.
Ultimately ␈αβwe ␈αβwould ␈αβlike ␈αβto ␈αβhave ␈αβa ␈αβsystem ␈ααcapable ␈ααof ␈ααunderstanding ␈ααspeech ␈ααfrom ␈ααan ␈ααunlimited ␈ααdomain ␈ααof
discourse ␈ααand ␈ααwith ␈ααunknown ␈ααspeakers. ␈αα ␈ααIt ␈ααseems ␈ααnot ␈ααunreasonable ␈ααto ␈ααexpect ␈ααthe ␈ααsystem ␈α↓to ␈α↓deal ␈α↓with ␈α↓this
situation ␈α	very ␈α	much ␈αλas ␈αλpeople ␈αλdo ␈αλwhen ␈αλthey ␈αλadapt ␈αλtheir ␈αλunderstanding ␈αλprocesses ␈αλto ␈αλthe ␈αλspeakers
idiosyncrasies during the conversation.

        With ␈α∧so ␈αβmuch ␈αβof ␈αβthe ␈αβcurrent ␈αβwork ␈αβon ␈αβspeech ␈αβunderstanding ␈αβbeing ␈αβdevoted ␈αβto ␈αβthe ␈αβdevelopment ␈αβof
systems ␈αβdesigned ␈αβto ␈ααwork ␈ααin ␈ααa ␈ααlimited ␈ααfield ␈ααof ␈ααdiscourse ␈ααand ␈ααwith ␈ααa ␈ααlimited ␈ααnumber ␈ααof ␈ααspeakers, ␈ααit ␈ααseems
desirable ␈α↓for ␈α↓a ␈α↓minimal ␈α↓program ␈α↓to ␈α↓be ␈α↓continued that is not so restricted.  It is felt that we should not lose
sight ␈αεof ␈αεthose ␈αεaspects ␈αεof ␈αεthe ␈αεproblem ␈αεthat ␈αεare ␈αεfor ␈αεthe ␈αεmoment ␈αεperipherial ␈αεto ␈αεthe ␈αεimmediate ␈αεaims ␈αεof
developing ␈α↓the ␈α↓best ␈α↓complete ␈α↓system ␈α↓that ␈α↓can ␈α↓currently ␈α↓be ␈α↓built.  Stanford University is well suited as the
site ␈α¬for ␈α¬such ␈α¬work, ␈α¬having ␈α¬both ␈α¬the ␈α¬facilities ␈α¬for ␈α¬this ␈α¬work ␈α∧and ␈α∧a ␈α∧staff ␈α∧of ␈α∧people ␈α∧with ␈α∧experience ␈α∧and
interest in machine learning, phonetic analysis, and digital signal processing.

        The ␈αεinitial ␈αεthrust ␈αεof ␈αεthe ␈αεproposed ␈αεwork ␈αεwould ␈αεbe ␈α¬toward ␈α¬the ␈α¬development ␈α¬of ␈α¬adaptive ␈α¬learning
techniques, ␈αβusing ␈αβthe ␈ααsignature ␈ααtable ␈ααmethod ␈ααand ␈ααsome ␈ααmore ␈ααrecent ␈ααvarients ␈ααand ␈ααextentions ␈ααof ␈ααthis ␈ααbasic
procedure. ␈α¬ ␈α¬We ␈α¬have ␈α¬already ␈α¬demonstrated ␈α¬the ␈α¬usefulness ␈α¬of ␈α¬this ␈α¬method ␈α∧for ␈α∧the ␈α∧initial ␈α∧assignment ␈α∧of
significant ␈α↓features ␈α↓to ␈α↓the ␈α↓acoustic ␈α↓signals. ␈α↓ ␈α↓One ␈α↓of ␈α↓the ␈α↓next ␈α↓steps ␈α↓will ␈α↓be ␈α↓to ␈α↓extend ␈α↓the ␈α↓method ␈α↓to include
acoustic-phonetic ␈α	probabilities ␈α	in ␈α	the ␈α	decision ␈α	process. ␈α	 ␈α	Finally ␈α	we ␈α	would ␈α	hope ␈αλto ␈αλtake ␈αλaccount ␈αλof
syntactical and semantic constraints in a somewhat analogous fashion.

        Still ␈αβanother ␈αβaspect ␈αβto ␈αβbe ␈αβstudied ␈αβwould ␈αβbe ␈αβthe ␈αβamount ␈ααof ␈ααpreprocessing ␈ααthat ␈ααshould ␈ααbe ␈ααdone ␈ααand
the ␈αβdesired ␈αβbalance ␈αβbetween ␈αβbottom-up ␈αβand ␈αβtop-down ␈αβapproaches. ␈αβ ␈αβIt ␈ααis ␈ααfairly ␈ααobvious ␈ααthat ␈ααdecisions ␈ααof
this ␈ααsort ␈ααshould ␈ααideally ␈ααbe ␈α↓made ␈α↓adaptively ␈α↓depending ␈α↓upon ␈α↓the ␈α↓familiarity ␈α↓of ␈α↓the ␈α↓system ␈α↓with ␈α↓the ␈α↓current
domain ␈αβof ␈αβdiscourse ␈αβand ␈αβwith ␈αβthe ␈αβcharacteristics ␈αβof ␈αβthe ␈αβcurrent ␈αβspeaker. ␈αβ ␈αβCompromises ␈αβwill ␈αβundoubtedly
have ␈αβto ␈ααbe ␈ααmade ␈ααin ␈ααany ␈ααimmediately ␈ααrealizable ␈ααsystem ␈ααbut ␈ααwe ␈ααshould ␈ααunderstand ␈ααbetter ␈ααthan ␈ααwe ␈ααnow ␈ααdo
the limitations on the system that such compromises impose.

        Finally ␈ααwe ␈ααwould ␈ααpropose ␈ααaccepting ␈ααresponsibility ␈ααfor ␈ααkeeping ␈ααother ␈ααrelated ␈ααprojects ␈α↓supplied ␈α↓with
operating ␈α↓versions ␈α↓of ␈α↓the ␈α↓best ␈α↓current ␈α↓programs ␈α↓that ␈α↓we ␈α↓have ␈α↓developed ␈α↓to ␈α↓interface the output from the
digitized speech or from a frequency domain expression of this input to the rest of the overall system.

        It ␈αβmay ␈αβbe ␈αβwell ␈ααat ␈ααthis ␈ααpoint ␈ααto ␈ααdiscribe ␈ααthe ␈ααgeneral ␈ααphilosophy ␈ααthat ␈ααhas ␈ααbeen ␈ααfollowed ␈ααin ␈ααthe ␈ααwork
that ␈α∧is ␈α∧currently ␈α∧under ␈α∧way ␈α∧and ␈α∧the ␈α∧results ␈α∧that ␈α∧have ␈α∧been ␈α∧achieved ␈α∧to ␈α∧date. ␈αβ ␈αβWe ␈αβhave ␈αβbeen ␈αβstudying
elements ␈ααof ␈ααa ␈ααspeech ␈ααrecognition ␈ααsystem ␈ααthat ␈ααis ␈ααnot ␈ααdependent ␈ααupon ␈ααthe ␈ααuse ␈ααof ␈ααa ␈ααlimited ␈ααvocabulary ␈α↓and
that can recognize continuous speech by a number of different speakers.

        Such ␈ααa ␈ααsystem ␈ααshould ␈ααbe ␈ααable ␈ααto ␈ααfunction ␈ααsuccessfully ␈ααeither ␈ααwithout ␈ααany ␈ααprevious ␈ααtraining ␈ααfor ␈ααthe
specific ␈α¬speaker ␈α¬in ␈α¬question ␈α¬or ␈α¬after ␈α¬a ␈α¬short ␈α∧training ␈α∧session ␈α∧in ␈α∧which ␈α∧the ␈α∧speaker ␈α∧would ␈α∧be ␈α∧asked ␈α∧to
repeat ␈ααcertain ␈ααphrases ␈ααdesigned ␈ααto ␈ααtrain ␈ααthe ␈ααsystem ␈ααon ␈ααthose ␈ααphonetic ␈ααutterances ␈ααthat ␈ααseemed ␈ααto ␈α↓depart
from ␈α¬the ␈α¬previously ␈α∧learned ␈α∧norm. ␈α∧ ␈α∧In ␈α∧either ␈α∧case ␈α∧it ␈α∧is ␈α∧believed ␈α∧that ␈α∧some ␈α∧automatic ␈α∧or ␈α∧semi-automatic
training ␈ααsystem ␈ααshould ␈ααbe ␈ααemployed ␈ααto ␈ααacquire ␈ααthe ␈α↓data ␈α↓that ␈α↓is ␈α↓used ␈α↓for ␈α↓the ␈α↓identification ␈α↓of ␈α↓the ␈α↓phonetic
information ␈α¬in ␈α¬the ␈α¬speech. ␈α¬ ␈α¬We ␈α¬believe ␈α¬that ␈α¬this ␈α¬can ␈α∧best ␈α∧be ␈α∧done ␈α∧by ␈α∧employing ␈α∧a ␈α∧modification ␈α∧of ␈α∧the
signature ␈α¬table ␈α¬scheme ␈α¬previously ␈α¬discribed. ␈α¬ ␈α¬A ␈α¬brief ␈α¬review ␈α¬of ␈α¬this ␈α¬earlier ␈α¬form ␈α∧of ␈α∧signature ␈α∧table ␈α∧is
given in Appendix 1.

        The ␈ααover-all ␈ααsystem ␈ααis ␈ααenvisioned ␈ααas ␈ααone ␈ααin ␈ααwhich ␈ααthe ␈α↓more ␈α↓or ␈α↓less ␈α↓conventional ␈α↓method ␈α↓is ␈α↓used ␈α↓of
separating ␈α↓the input speech into short time slices for which some sort of frequency analysis, homomorphic,
LPC, ␈α↓or ␈α↓the ␈α↓like, ␈α↓is ␈α↓done. ␈α↓ ␈α↓We ␈α↓then ␈α↓interpret ␈α↓this information in terms of significant features by means of a
set ␈αβof ␈αβsignature ␈αβtables. ␈αβ ␈αβAt ␈αβthis ␈αβpoint ␈αβwe ␈αβdefine ␈αβlonger ␈αβsections ␈αβof ␈αβthe ␈αβspeech ␈ααcalled ␈ααEVENTS ␈ααwhich ␈ααare
obtained ␈αβby ␈αβgrouping ␈αβtogather ␈ααvarying ␈ααnumbers ␈ααof ␈ααthe ␈ααoriginal ␈ααslices ␈ααon ␈ααthe ␈ααbasis ␈ααof ␈ααtheir ␈ααsimilarity.This
then ␈αβtakes ␈αβthe ␈αβplace ␈αβof ␈αβother ␈αβforms ␈αβof ␈αβinitial ␈αβsegmentation. ␈αα ␈ααHaving ␈ααidentified ␈ααa ␈ααseries ␈ααof ␈ααEVENTS ␈ααin ␈ααthis
way ␈ααwe ␈ααnext ␈ααuse ␈ααanother ␈ααset ␈ααof ␈ααsignature ␈ααtables ␈ααto ␈ααextract ␈ααinformation ␈α↓from ␈α↓the ␈α↓sequence ␈α↓of ␈α↓events ␈α↓and
combine it with a limited amount of syntactic and semantic information to define a sequence of phonemes.


        Signature ␈ααtables ␈ααcan ␈ααbe ␈ααused ␈ααto ␈α↓perform ␈α↓four ␈α↓essential ␈α↓functions ␈α↓that ␈α↓are ␈α↓required ␈α↓in ␈α↓the ␈α↓automatic
recognition ␈αβof ␈αβspeech. ␈αβ ␈ααThese ␈ααfunctions ␈ααare: ␈αα(1) ␈ααthe ␈ααelimination ␈ααof ␈ααsuperfluous ␈ααand ␈ααredundant ␈ααinformation
from ␈α∧the ␈α∧acoustic ␈α∧input ␈α∧stream, ␈αβ(2) ␈αβthe ␈αβtransformation ␈αβof ␈αβthe ␈αβremaining ␈αβinformation ␈αβfrom ␈αβone ␈αβcoordinate
system ␈α∧to ␈α∧a ␈α∧more ␈α∧phonetically ␈α∧meaningful ␈αβcoordinate ␈αβsystem, ␈αβ(3) ␈αβthe ␈αβmixing ␈αβof ␈αβacoustically ␈αβderived ␈αβdata
with ␈αsyntactic, ␈αsemantic ␈αand ␈αlinguistic ␈αinformation ␈αto ␈αobtain ␈αthe ␈αdesired ␈αrecognition, ␈αand ␈α(4) ␈αthe
introduction of a learning mechanism.

        The following three advantages emerge from this method of training and evaluation.
        1) ␈α↓Essentially arbitrary inter-relationships between the input terms are taken in account by any one
table.  The only loss of accuracy is in the quantization.
        2) ␈α
The ␈α
training ␈α	is ␈α	a ␈α	very ␈α	simple ␈α	process ␈α	of ␈α	accumulating ␈α	counts. ␈α	 ␈α	The ␈α	training ␈α	samples ␈α	are
introduced sequentially, and hence simultaneous storage of all the samples is not required.
        3) The process linearizes the storage requirements in the parameter space.

        The ␈αsignature ␈αtables, ␈α
as ␈α
used ␈α
in ␈α
speech ␈α
recognition,must ␈α
be ␈α
particularized ␈α
to ␈α
allow ␈α
for ␈α
the
multi-catagory ␈αεnature ␈αεof ␈αεthe ␈αεoutput. ␈αε ␈αεSeveral ␈αεforms ␈α¬of ␈α¬tables ␈α¬have ␈α¬been ␈α¬investigated. ␈α¬ ␈α¬Details ␈α¬of ␈α¬the
current system are given in Appendix 2.  Some results are summarized in an attached report.


        Work ␈α↓is ␈α↓currently ␈α↓under ␈α↓way ␈α↓on ␈α↓a ␈α↓major ␈α↓refinement ␈α↓of ␈α↓the ␈α↓signature ␈α↓table ␈α↓approach ␈α↓which adopts a
somewhat ␈απmore ␈απrigorous ␈απprocedure. ␈απ ␈απPreliminary ␈απresults ␈απwith ␈απthis ␈απscheme ␈απindicate ␈απthat ␈αεa ␈αεsubstantial
improvement has been achieved.

                Appendix 1

        The early form of a signature table

        For ␈α↓those ␈α↓not ␈α↓familiar ␈α↓with ␈α↓the ␈α↓use ␈α↓of ␈α↓signature ␈α↓tables ␈α↓as used by Samuel in programs which played
the ␈α¬game ␈α¬of ␈α¬checkers, ␈α¬the ␈α¬concept ␈α¬is ␈α¬best ␈α¬illustrated ␈α¬(Fig.1) ␈α¬by ␈α¬an ␈α¬arrangement ␈α¬of ␈α¬tables ␈α¬used ␈α¬in ␈α∧the
program. ␈αα ␈ααThere ␈ααare ␈αα27 ␈ααinput ␈ααterms. ␈αα ␈ααEach ␈ααterm ␈ααevaluates ␈ααa ␈ααspecific ␈ααaspect ␈ααof ␈ααa ␈ααboard ␈α↓situation ␈α↓and ␈α↓it ␈α↓is
quantized ␈α↓into ␈α↓a ␈α↓limited ␈α↓but ␈α↓adequate ␈α↓range of values, 7,5,and 3, in this case.  The terms are divided into 9
sets ␈ααwith ␈αα3 ␈ααterms ␈ααeach, ␈ααforming ␈ααthe ␈α↓9 ␈α↓first ␈α↓level ␈α↓tables. ␈α↓ ␈α↓Outputs ␈α↓from ␈α↓the ␈α↓first ␈α↓level ␈α↓tables ␈α↓are ␈α↓quantized
to 5 levels and combined into 3 second level tables and, finally, into one

third-level table whose output represents the figure of merit of the board in question.
        A ␈ααsignature ␈ααtable ␈ααhas ␈ααan ␈ααentry ␈ααfor ␈ααevery ␈ααpossible ␈α↓combination ␈α↓of ␈α↓the ␈α↓input ␈α↓vector. ␈α↓ ␈α↓Thus ␈α↓there ␈α↓are
7*5*3 ␈αβor ␈αβ105 ␈αβentries ␈αβin ␈αβeach ␈αβof ␈αβthe ␈αβfirst ␈αβlevel ␈αβtables. ␈αβ ␈αβTraining ␈αβconsists ␈αβof ␈αβaccumulating ␈ααtwo ␈ααcounts ␈ααfor
each ␈α↓entry ␈α↓during ␈α↓a ␈α↓training ␈α↓sequence. ␈α↓ ␈α↓Count ␈α↓A ␈α↓is ␈α↓incremented ␈α↓when ␈α↓the ␈α↓current ␈α↓input ␈α↓vector ␈α↓represents
a ␈α↓prefered ␈α↓move ␈α↓and ␈α↓count ␈α↓D ␈α↓is ␈α↓incremented ␈α↓when ␈α↓it ␈α↓is not the prefered move.  The output from the table
is computed as a correlation coeficient
                        C=(A-D)/(A+D). ␈α∧ ␈α∧The ␈α∧figure ␈α∧of ␈α∧merit ␈α∧for ␈α∧a ␈αβboard ␈αβis ␈αβsimply ␈αβthe ␈αβcoefficient ␈αβobtained ␈αβas
the output from the final table.

                Appendix 2

        Initial Form of Signature Table for Speech Recognition

        The ␈αsignature ␈αtables, ␈α
as ␈α
used ␈α
in ␈α
speech ␈α
recognition,must ␈α
be ␈α
particularized ␈α
to ␈α
allow ␈α
for ␈α
the
multi-catagory ␈α¬nature ␈α∧of ␈α∧the ␈α∧output. ␈α∧ ␈α∧Several ␈α∧forms ␈α∧of ␈α∧tables ␈α∧have ␈α∧been ␈α∧investigated. ␈α∧ ␈α∧The ␈α∧initial ␈α∧form
tested ␈αεand ␈αεused ␈αεfor ␈αεthe ␈αεdata ␈αεpresented ␈αεin ␈αεthe ␈α¬attached ␈α¬paper ␈α¬uses ␈α¬tables ␈α¬consisting ␈α¬of ␈α¬two ␈α¬parts, ␈α¬a
preamble ␈ααand ␈ααthe ␈ααtable ␈ααproper. ␈αα ␈ααThe ␈ααpreamble ␈ααcontains: ␈αα(1) ␈ααspace ␈ααfor ␈α↓saving ␈α↓a ␈α↓record ␈α↓of ␈α↓the ␈α↓current ␈α↓and
recent ␈α¬output ␈α¬reports ␈α¬from ␈α∧the ␈α∧table, ␈α∧(2) ␈α∧identifying ␈α∧information ␈α∧as ␈α∧to ␈α∧the ␈α∧specific ␈α∧type ␈α∧of ␈α∧table, ␈α∧(3) ␈α∧a
parameter ␈ααthat ␈ααidentifies ␈ααthe ␈ααdesired ␈ααoutput ␈α↓from ␈α↓the ␈α↓table ␈α↓and ␈α↓that ␈α↓is ␈α↓used ␈α↓in ␈α↓the ␈α↓learning ␈α↓process, ␈α↓(4) ␈α↓a
gating ␈α↓parameter ␈α↓specifying ␈α↓the ␈α↓input, ␈α↓that ␈α↓is ␈α↓to ␈α↓be ␈α↓used ␈α↓to ␈α↓gate ␈α↓the ␈α↓table, ␈α↓(6) ␈α↓the ␈α↓gating ␈α↓level to be used
and (7) parameters that identify the sources of the normal inputs to the table.

        All ␈α↓inputs ␈α↓are ␈α↓limited ␈α↓in ␈α↓range ␈α↓and ␈α↓specify ␈α↓either ␈α↓the ␈α↓absolute ␈α↓level ␈α↓of ␈α↓some basic property or more
usually ␈αβthe ␈αβprobability ␈αβof ␈αβsome ␈αβproperty ␈ααbeing ␈ααpresent. ␈αα ␈ααThese ␈ααinputs ␈ααmay ␈ααbe ␈ααfrom ␈ααthe ␈ααoriginal ␈ααacoustic
input ␈ααor ␈ααthey ␈ααmay ␈ααbe ␈ααthe ␈ααoutputs ␈ααof ␈ααother ␈ααtables. ␈αα ␈ααIf ␈ααfrom ␈α↓other ␈α↓tables ␈α↓they ␈α↓may ␈α↓be ␈α↓for ␈α↓the ␈α↓current ␈α↓time
step or for earlier time steps, (subject to practical limits as to the number of time steps that are saved).

        The ␈α∧output, ␈α∧or ␈α∧outputs, ␈α∧from ␈α∧each ␈α∧table ␈α∧are ␈α∧similarly ␈α∧limited ␈α∧in ␈α∧range ␈α∧and ␈α∧specify, ␈α∧in ␈α∧all ␈α∧cases, ␈α∧a
probability ␈αβthat ␈αβsome ␈ααparticular ␈ααsignificant ␈ααfeature, ␈ααphonette, ␈ααphoneme, ␈ααword ␈ααsegment, ␈ααword ␈ααor ␈ααphrase ␈ααis
present.

        We ␈α∧are ␈α∧limiting ␈α∧the ␈α∧range ␈α∧of ␈α∧inputs ␈α∧and ␈α∧outputs ␈αβto ␈αβvalues ␈αβspecified ␈αβby ␈αβ3 ␈αβbits ␈αβand ␈αβthe ␈αβnumber ␈αβof
entries ␈α↓per ␈α↓table to 64 although this choice of values is a matter to be determined by experiment.  We are
also ␈α↓providing ␈α↓for ␈α↓any ␈α↓of ␈α↓the ␈α↓following ␈α↓input ␈α↓combinations, ␈α↓(1) ␈α↓one ␈α↓input ␈α↓of ␈α↓6 ␈α↓bits, ␈α↓(2) ␈α↓two ␈α↓inputs ␈α↓of 3 bits
each, ␈αβ(3) ␈αβthree ␈αβinputs ␈αβof ␈αα2 ␈ααbits ␈ααeach, ␈ααand ␈αα(4) ␈ααsix ␈ααinputs ␈ααof ␈αα1 ␈ααbit ␈ααeach. ␈αα ␈ααThe ␈ααuses ␈ααto ␈ααwhich ␈ααthese ␈ααdifferint
forms are put will be described later.

        The ␈α↓body ␈α↓of ␈α↓each ␈α↓table ␈α↓contains entries corresponding to every possible combination of the allowed
input ␈α↓parameters. ␈α↓ ␈α↓Each ␈α↓entry ␈α↓in ␈α↓the ␈α↓table ␈α↓actually ␈α↓consists ␈α↓of ␈α↓several ␈α↓parts. ␈α↓ ␈α↓There ␈α↓are ␈α↓fields assigned to
accumulate ␈ααcounts ␈ααof ␈ααthe ␈ααoccurrances ␈ααof ␈ααincidents ␈ααin ␈ααwhich ␈α↓the ␈α↓specifying ␈α↓input ␈α↓values ␈α↓coincided ␈α↓with ␈α↓the
different ␈αβdesired ␈αβoutputs ␈αβfrom ␈αβthe ␈αβtable ␈αβas ␈αβfound ␈αβduring ␈αβprevious ␈ααlearning ␈ααsessions ␈ααand ␈ααthere ␈ααare ␈ααfields
containing ␈αβthe ␈αβsummarized ␈αβresults ␈αβof ␈αβthese ␈αβlearning ␈αβsessions, ␈αβwhich ␈αβare ␈αβused ␈αβas ␈αβoutputs ␈αβfrom ␈αβthe ␈αβtable.
The ␈ααoutputs ␈ααfrom ␈ααthe ␈ααtables ␈ααcan ␈ααthen ␈ααexpress ␈ααto ␈ααthe ␈ααallowed ␈ααaccuracy ␈ααall ␈ααpossible ␈ααfunctions ␈ααof ␈ααthe ␈α↓input
parameters.

Operation in the Training Mode

        When ␈ααoperating ␈ααin ␈ααthe ␈ααtraining ␈ααmode ␈ααthe ␈ααprogram ␈ααis ␈α↓supplied ␈α↓with ␈α↓a ␈α↓sequence ␈α↓of ␈α↓stored ␈α↓utterances
with ␈απaccompanying ␈απphonetic ␈απtranscriptions. ␈απ ␈απEach ␈απsegment ␈απof ␈απthe ␈απincoming ␈απspeech ␈απsignal ␈απis ␈απanalysed
(Fourier ␈α∧transforms ␈αβor ␈αβinverse ␈αβfilter ␈αβequivalent) ␈αβto ␈αβobtain ␈αβthe ␈αβnecessary ␈αβinput ␈αβparmeters ␈αβfor ␈αβthe ␈αβlowest
level ␈αβtables ␈αβin ␈αβthe ␈ααsignature ␈ααtable ␈ααhierarchy. ␈αα ␈ααAt ␈ααthe ␈ααsame ␈ααtime ␈ααreference ␈ααis ␈ααmade ␈ααto ␈ααa ␈ααtable ␈ααof ␈ααphonetic
"hints" ␈α∧which ␈α∧prescribe ␈α∧the ␈α∧desired ␈α∧outputs ␈α∧from ␈α∧each ␈α∧table ␈α∧which ␈α∧correspond ␈α∧to ␈αβall ␈αβpossible ␈αβphonemic
inputs.  The signature tables are then processed.

        The ␈ααprocessing ␈ααof ␈ααeach ␈ααtable ␈ααis ␈ααdone ␈ααin ␈ααtwo ␈ααsteps, ␈ααone ␈ααprocess ␈ααat ␈ααeach ␈ααentry ␈α↓to ␈α↓the ␈α↓table ␈α↓and ␈α↓the
second ␈α¬only ␈α¬periodically. ␈α¬ ␈α¬The ␈α¬first ␈α¬process ␈α∧consists ␈α∧of ␈α∧locating ␈α∧a ␈α∧single ␈α∧entry ␈α∧line ␈α∧within ␈α∧the ␈α∧table ␈α∧as
specified ␈α↓by ␈α↓the ␈α↓inputs to the table and adding a 1 to the appropriate field to indicate the presence of the
property ␈α↓specified ␈α↓by ␈α↓hint ␈α↓table ␈α↓as ␈α↓corresponding ␈α↓to ␈α↓the ␈α↓phoneme ␈α↓specified ␈α↓in the phonemic transcription.
At ␈α∧this ␈α∧time ␈α∧a ␈α∧report ␈α∧is ␈α∧also ␈α∧made ␈α∧as ␈αβto ␈αβthe ␈αβtable's ␈αβoutput ␈αβas ␈αβdetermined ␈αβfrom ␈αβthe ␈αβaveraged ␈αβresults ␈αβof
previous ␈αβlearning ␈αβso ␈αβthat ␈αβa ␈αβrunning ␈αβrecord ␈αβmay ␈αβbe ␈αβkept ␈αβof ␈αβthe ␈αβperformance ␈αβof ␈αβthe ␈αβsystem. ␈αα ␈ααAt ␈ααperiodic
intervals ␈απall ␈απtables ␈απare ␈απupdated ␈απto ␈απincorporate ␈απrecent ␈απlearning ␈απresults. ␈απ ␈αεTo ␈αεmake ␈αεthis ␈αεprocess ␈αεeasily
understandable, ␈α∧let ␈α∧us ␈α∧restrict ␈α∧our ␈α∧attention ␈α∧to ␈α∧a ␈α∧table ␈α∧used ␈α∧to ␈α∧identify ␈α∧a ␈α∧single ␈α∧significant ␈α∧feature ␈αβsay
Voicing. ␈αε ␈α¬The ␈α¬hint ␈α¬table ␈α¬will ␈α¬identify ␈α¬whether ␈α¬or ␈α¬not ␈α¬the ␈α¬phoneme ␈α¬currently ␈α¬being ␈α¬processed ␈α¬is ␈α¬to ␈α¬be
considered ␈ααvoiced. ␈αα ␈ααIf ␈ααit ␈ααis ␈ααvoiced, ␈ααa ␈αα1 ␈ααis ␈ααadded ␈ααto ␈ααthe ␈αα"yes" ␈ααfield ␈ααof ␈ααthe ␈ααentry ␈ααline ␈α↓located ␈α↓by ␈α↓the ␈α↓normal
inputs ␈α↓to ␈α↓the ␈α↓table. ␈α↓ ␈α↓If ␈α↓it ␈α↓is ␈α↓not ␈α↓voiced, a 1 is added to the "no" field.  At updating time the output that this
entry ␈α∧will ␈α∧subsequently ␈α∧report ␈α∧is ␈α∧determined ␈α∧by ␈α∧dividing ␈αβthe ␈αβaccumulated ␈αβsum ␈αβin ␈αβthe ␈αβ"yes" ␈αβfield ␈αβby ␈αβthe
sum ␈α↓of ␈α↓the ␈α↓numbers ␈α↓in ␈α↓the ␈α↓"yes" ␈α↓and ␈α↓the ␈α↓"no" ␈α↓fields, ␈α↓and ␈α↓reporting ␈α↓this ␈α↓quantity ␈α↓as ␈α↓a ␈α↓number ␈α↓in ␈α↓the range
from ␈α↓0 ␈α↓to ␈α↓7. ␈α↓ ␈α↓Actually ␈α↓the ␈α↓process ␈α↓is ␈α↓a ␈α↓bit ␈α↓more ␈α↓complicated ␈α↓than ␈α↓this ␈α↓and ␈α↓it ␈α↓varies ␈α↓with ␈α↓the ␈α↓exact ␈α↓type ␈α↓of
table ␈ααunder ␈ααconsideration, ␈ααas ␈ααreported ␈ααin ␈ααdetail ␈ααin ␈ααappendix ␈α↓B. ␈α↓ ␈α↓Outputs ␈α↓from ␈α↓the ␈α↓signature ␈α↓tables ␈α↓are ␈α↓not
probabilities, in the strict sense, but are the statistically-arrived-at odds based on the actual

learning sequence.

        The ␈αβpreamble ␈αβof ␈αβthe ␈αβtable ␈αβhas ␈αβspace ␈αβfor ␈αβstoring ␈αβtwelve ␈αβpast ␈αβoutputs. ␈αα ␈ααAn ␈ααinput ␈ααto ␈ααa ␈ααtable ␈ααcan ␈ααbe
delayed ␈αβto ␈αβthat ␈αβextent.This ␈αβtable ␈αβrelates ␈αβoutcomes ␈ααof ␈ααprevious ␈ααevents ␈ααwith ␈ααthe ␈ααpresent ␈ααhint-the ␈ααlearning
input.A certain amount of context dependent learning is thus

possible with the limitation that the specified delays are constant.

        The ␈αinterconnected ␈αhierarchy ␈αof ␈αtables ␈αform ␈αa ␈α
network ␈α
which ␈α
runs ␈α
increamentally, ␈α
in ␈α
steps
synchronous ␈ααwith ␈ααtime ␈ααwindow ␈ααover ␈α↓which ␈α↓the ␈α↓input ␈α↓signal ␈α↓is ␈α↓analised.The ␈α↓present ␈α↓window ␈α↓width ␈α↓is ␈α↓set ␈α↓at
12.8 ␈αλms.(256 ␈αλpoints ␈απat ␈απ20 ␈απK ␈απsamples/sec.) ␈απwith ␈απoverlap ␈απof ␈απ6.4 ␈απms. ␈απ ␈απInputs ␈απto ␈απthis ␈απnetwork ␈απare ␈απthe
parameters ␈α↓abstracted ␈α↓from the frequency analyses of the signal, and the specified hint.The outputs of the
network ␈αεcould ␈αεbe ␈αεeither ␈αεthe ␈αεprobability ␈αεattached ␈αεto ␈αεevery ␈αεphonetic ␈αεsymbol ␈αεor ␈αεthe ␈αεoutput ␈αεof ␈αεa ␈αεtable
associated ␈α↓with a feature such as voiced,vowel ect.The point to be made is that the output generated for a
segment ␈ααis ␈ααessentially ␈ααindependent ␈α↓of ␈α↓its ␈α↓contiguous ␈α↓segments.The ␈α↓dependency ␈α↓achieved ␈α↓by ␈α↓using ␈α↓delayes
in ␈α∧the ␈α∧inputs ␈α∧is ␈α∧invisible ␈α∧to ␈α∧the ␈α∧outputs.The ␈α∧outputs ␈α∧thus ␈α∧report ␈α∧the ␈αβbest ␈αβestimate ␈αβon ␈αβwhat ␈αβthe ␈αβcurrent
acoustic ␈απinput ␈απis ␈απwith ␈απno ␈απrelation ␈απto ␈απthe ␈απpast ␈αεoutputs.Relating ␈αεthe ␈αεsuccessive ␈αεoutputs ␈αεalong ␈αεthe ␈αεtime
dimension is realised by counters.

The Use of COUNTERS

        The ␈α∧transition ␈α∧from ␈α∧initial ␈α∧segment ␈α∧space ␈α∧to ␈α∧event ␈α∧space ␈α∧is ␈αβmade ␈αβposible ␈αβby ␈αβmeans ␈αβof ␈αβCOUNTERS
which ␈αβare ␈αβsummed ␈αβand ␈αβreiniated ␈αβwhenever ␈αβtheir ␈αβinputs ␈αβcross ␈αβspecified ␈ααthreshold ␈ααvalues, ␈ααbeing ␈ααtriggered
on ␈α↓when ␈α↓the ␈α↓input ␈α↓exceeds ␈α↓the ␈α↓threshold ␈α↓and off when it falls below.  Momentary spikes are eliminated by
specifying ␈α∧time ␈α∧hysteresis, ␈α∧the ␈α∧number ␈α∧of ␈α∧consecutive ␈αβsegments ␈αβfor ␈αβwhich ␈αβthe ␈αβinput ␈αβmust ␈αβbe ␈αβabove ␈αβthe
threshold.The ␈ααoutput ␈ααof ␈ααa ␈ααcounter ␈α↓provides ␈α↓information ␈α↓about ␈α↓starting ␈α↓time,duration ␈α↓and ␈α↓average ␈α↓input ␈α↓for
the period it was active.

        Since ␈α¬a ␈α¬counter ␈α∧can ␈α∧reference ␈α∧a ␈α∧table ␈α∧at ␈α∧any ␈α∧level ␈α∧in ␈α∧the ␈α∧hierarchy ␈α∧of ␈α∧tables, ␈α∧it ␈α∧can ␈α∧reflect ␈α∧any
desired ␈α¬degree ␈α¬of ␈α¬information ␈α¬reduction. ␈α¬ ␈α¬For ␈α¬example, ␈α∧a ␈α∧counter ␈α∧may ␈α∧be ␈α∧set ␈α∧up ␈α∧to ␈α∧show ␈α∧a ␈α∧section ␈α∧of
speech ␈α¬to ␈α¬be ␈α¬a ␈α¬vowel,a ␈α∧front ␈α∧vowel ␈α∧or ␈α∧the ␈α∧vowel ␈α∧/I/.The ␈α∧counters ␈α∧can ␈α∧be ␈α∧looked ␈α∧upon ␈α∧to ␈α∧represent ␈α∧a
mapping ␈ααof ␈α↓parameter-time ␈α↓space ␈α↓into ␈α↓a ␈α↓feature-time ␈α↓space, ␈α↓or ␈α↓at ␈α↓a ␈α↓higher ␈α↓level ␈α↓symbol-time ␈α↓space.It ␈α↓may
be ␈α¬useful ␈α¬to ␈α¬carry ␈α¬along ␈α¬the ␈α¬feature ␈α¬information ␈α¬as ␈α¬a ␈α¬back ␈α¬up ␈α¬in ␈α¬those ␈α¬situations ␈α∧where ␈α∧the ␈α∧symbolic
information is not acceptable to syntactic or semantic interpretation.

        In ␈απthe ␈αεsame ␈αεmanner ␈αεas ␈αεthe ␈αεtables, ␈αεthe ␈αεcounters ␈αεrun ␈αεcompletely ␈αεindependent ␈αεof ␈αεeach ␈αεother.In ␈αεa
recognition ␈ααrun ␈ααthe ␈ααcounters ␈ααmay ␈ααoverlap ␈ααin ␈ααarbitrary ␈ααfashion, ␈ααmay ␈ααleave ␈ααout ␈ααgaps ␈α↓where ␈α↓no ␈α↓counter ␈α↓has
been ␈α∧triggered ␈α∧or ␈α∧may ␈α∧not ␈α∧line ␈α∧up ␈α∧nicely.A ␈α∧properly ␈αβsegmented ␈αβoutput, ␈αβwhere ␈αβthe ␈αβconsecutive ␈αβsections
are ␈αεin ␈αεtime ␈αεsequence ␈αεand ␈αεare ␈αεneatly ␈αεlabled, ␈αεis ␈αεessential ␈αεfor ␈αεprocessing ␈αεit ␈αεfurther.This ␈α¬is ␈α¬achieved ␈α¬by
registering ␈απthe ␈αεinstants ␈αεwhen ␈αεthe ␈αεcounters ␈αεare ␈αεtriggered ␈αεor ␈αεterminated ␈αεto ␈αεform ␈αεtime ␈αεsegments ␈αεcalled
events.

        An ␈αλevent ␈αλis ␈αλthe ␈αλperiod ␈αλbetween ␈αλsuccessive ␈απactivation ␈απor ␈απtermination ␈απof ␈απany ␈απcounter.An ␈απevent
shorter ␈αεthan ␈αεa ␈αεspecified ␈αεtime ␈αεis ␈αεmerely ␈αεignored. ␈αε ␈αεA ␈αεrecord ␈αεof ␈αεevent ␈αεdurations ␈αεand ␈αεupto ␈α¬three ␈α¬active
counters, ordered according to their probability, is maintained.

        An ␈ααevent ␈ααresulting ␈ααfrom ␈ααthe ␈ααprocessing ␈ααdescribed ␈ααso ␈ααfar, ␈ααrepresents ␈ααa ␈ααphonette ␈αα- ␈ααone ␈α↓of ␈α↓the ␈α↓basic
speech ␈α↓categories ␈α↓defined ␈α↓as ␈α↓hints ␈α↓in ␈α↓the ␈α↓learning ␈α↓process.  It is only an estimate of closeness to a speech
category ␈α⊂, ␈α⊂based ␈α⊂on ␈α⊂past ␈α⊂learning.Also ␈α⊂each ␈α⊂category ␈α∂has ␈α∂a ␈α∂more-or-less ␈α∂stationary ␈α∂spectral
characterisation.Thus ␈α∧a ␈α∧category ␈α∧may ␈α∧have ␈α∧a ␈α∧phonemic ␈α∧equivalent ␈α∧as ␈α∧in ␈α∧the ␈α∧case ␈α∧of ␈α∧vowels ␈α∧, ␈α∧it ␈αβmay ␈αβbe
common ␈α↓to ␈α↓phoneme ␈α↓class ␈α↓as ␈α↓for ␈α↓the ␈α↓voiced ␈α↓or ␈α↓unvoiced ␈α↓stop ␈α↓gaps ␈α↓or it may be subphonemic as a T-burst
or ␈αβa ␈αβK-burst.The ␈αβchoices ␈αβare ␈αβbased ␈αβon ␈αβacoustic ␈αβexpediency, ␈ααi.e. ␈αα ␈ααoptimisation ␈ααof ␈ααthe ␈ααlearning ␈ααrather ␈ααthan
any ␈α¬linguistic ␈α¬considerations.However ␈α¬a ␈α¬higher ␈α¬level ␈α¬interpretive ␈α¬programs ␈α¬may ␈α¬best ␈α∧operate ␈α∧on ␈α∧inputs
resembling ␈α∧phonemic ␈α∧trancription.The ␈αβcontiguous ␈αβevents ␈αβmay ␈αβbe ␈αβcoalesced ␈αβinto ␈αβphoneme ␈αβlike ␈αβunits ␈αβusing
diadic ␈α↓or ␈α↓triadic ␈α↓probabilities ␈α↓and ␈α↓acoustic-phonetic ␈α↓rules ␈α↓particular ␈α↓to the system.For example, a period of
silence ␈α↓followed ␈α↓by ␈α↓a ␈α↓type ␈α↓of ␈α↓burst ␈α↓or ␈α↓a ␈α↓short ␈α↓friction ␈α↓may ␈α↓be ␈α↓combined ␈α↓to ␈α↓form ␈α↓the corresponding stop.A
short ␈αβfriction ␈αβor ␈αβa ␈αβburst ␈αβfollowing ␈αβa ␈ααnasal ␈ααor ␈ααa ␈ααlateral ␈ααmay ␈ααbe ␈ααcalled ␈ααa ␈ααstop ␈ααeven ␈ααif ␈ααthe ␈ααsilence ␈ααperiod ␈ααis
short ␈α∧or ␈α∧absent.Clearly ␈αβthese ␈αβrules ␈αβmust ␈αβbe ␈αβspecific ␈αβto ␈αβthe ␈αβsystem, ␈αβbased ␈αβon ␈αβthe ␈αβconfidence ␈αβwith ␈αβwhich
durations and phonette categories are recognised.

        While ␈αβit ␈αβwould ␈αβbe ␈αβpossible ␈αβto ␈αβextend ␈αβthis ␈αβbottom ␈αβup ␈ααapproach ␈ααstill ␈ααfurther, ␈ααit ␈ααseems ␈ααreasonable ␈ααto
break off at this point and revert to a top down approach from here on.  The real difference in the overall
system ␈α∧would ␈α∧then ␈αβbe ␈αβthat ␈αβthe ␈αβtop ␈αβdown ␈αβanalysis ␈αβwould ␈αβdeal ␈αβwith ␈αβthe ␈αβoutputs ␈αβfrom ␈αβthe ␈αβsignature ␈αβtable
section ␈α¬as ␈α¬its ␈α¬primatives ␈α¬rather ␈α∧than ␈α∧with ␈α∧the ␈α∧outputs ␈α∧from ␈α∧the ␈α∧initial ␈α∧measurements ␈α∧either ␈α∧in ␈α∧the ␈α∧time
domain ␈α∧or ␈α∧in ␈αβthe ␈αβfrequency ␈αβdomain. ␈αβ ␈αβIn ␈αβthe ␈αβcase ␈αβof ␈αβinconsistancies ␈αβthe ␈αβsystem ␈αβcould ␈αβeither ␈αβrefer ␈αβto ␈αβthe
second ␈ααchoices ␈ααretained ␈ααwithin ␈ααthe ␈ααsignature ␈α↓tables ␈α↓or ␈α↓if ␈α↓need ␈α↓be ␈α↓could ␈α↓always ␈α↓go ␈α↓clear ␈α↓back ␈α↓to ␈α↓the ␈α↓input
parameters. ␈α¬ ␈α∧The ␈α∧decision ␈α∧as ␈α∧to ␈α∧how ␈α∧far ␈α∧to ␈α∧carry ␈α∧the ␈α∧initial ␈α∧bottom ␈α∧up ␈α∧analysis ␈α∧must ␈α∧depend ␈α∧upon ␈α∧the
relative ␈α↓cost ␈α↓of ␈α↓this ␈α↓analysis both in complexity and processing time and the certainty with which it can be
performed ␈α∧as ␈α∧compaired ␈α∧with ␈α∧the ␈α∧costs ␈α∧associated ␈α∧with ␈α∧the ␈α∧rest ␈α∧of ␈α∧the ␈α∧analysis ␈α∧and ␈α∧the ␈α∧certainty ␈α∧with
which it can be performad, taking due notice of the costs in time of recovering from false starts.